Speech recognition device, speech recognition method, computer-executable program for causing computer to execute recognition method, and storage medium

ABSTRACT

A speech recognition device and method configured to include a computer, for recognizing speech, including: a storage location for storing a feature quantity acquired from a speech signal for each frame; storage portions for storing acoustic model data and language model data; a echo speech component for generating echo speech model data from a speech signal acquired prior to a speech signal to be processed at the current time point and using the echo speech model data to generate adapted acoustic model data; and a processing component for utilizing the feature quantity, the adapted acoustic model data, and the language model data to provide a speech recognition result of the speech signal.

FIELD OF THE INVENTION

The present invention relates to speech recognition by a computerdevice, and in particular to a speech recognition device forsufficiently recognizing an original speech even when the originalspeech is superimposed with an echo generated by the environment, aspeech recognition method, a computer-executable program for causing acomputer to execute the control method, and a storage medium.

BACKGROUND OF THE INVENTION

As controllability of peripheral devices by a computer device has beenimproved, systems for automatically recognizing a speech inputted as aspeech input from a microphone and the like are desirable. Theabove-mentioned speech recognition device for recognizing speech asinput can be assumed to be utilized for various applications such asdictation of a document, transcription of minutes of a meeting,interaction with a robot, and control of an external machine. Theabove-mentioned speech recognition device essentially analyzes inputtedspeech to acquire a feature quantity, selects a word corresponding tothe speech based on the acquired feature quantity, and thereby causes acomputer device to recognize the speech. Various methods have beenproposed to exclude influence from the environment, such as backgroundnoises, in performing speech recognition. A typical example is a methodin which a user is required to use a hand microphone or a head-set typemicrophone in order to exclude echoes or noises which may besuperimposed with the speech to be recorded and to acquire only theinputted speech. In such a method, a user is required to use such extrahardware as are not usually used.

One reason that a user is required to use the above-mentioned handmicrophone or a head-set type microphone is that, if the speaker speaksaway from a microphone, an echo may be generated depending on theenvironment, in addition to the influence of environmental noises. If anecho is superimposed onto an speech signal, in addition to noises,speech recognition mismatch is caused in a statistical model for eachspeech used in speech recognition (e.g., the hidden Markov model) whichresults in degradation of recognition efficiency.

FIG. 9 shows a typical method in which noises are taken intoconsideration when performing speech recognition. As shown in FIG. 9, ifthere is a noise, an inputted signal has a speech signal and outputprobability distribution in which the speech signal is superimposed witha noise signal. Since, in many cases, a noise occurs suddenly, a methodis employed in which a microphone for acquiring an input signal and amicrophone for acquiring a noise are used and, with the use of aso-called two-channel signal, a speech signal and a noise signal areseparately acquired from the input signal. A traditional speech signalshown in FIG. 9 is acquired from a first channel, and a noise signal isacquired from a second channel, so that, with a use of a two-channelsignal, an original speech signal can be recognized from an inputtedspeech signal even under a noisy environment.

However, hardware resources of a speech recognition device are consumedby use of data for two channels, and in addition, a two-channel inputmay not be available in some cases. Therefore, the above method does notalways enable efficient recognition. Furthermore, it may inconvenientlyrestrict realistic speech recognition that information of the twochannels is always required simultaneously.

Conventionally, as a method for coping with influence from a speechtransfer route, the cepstrum mean subtraction (CMS) method has beenemployed. A disadvantage has been known that the CMS method is effectivewhen the impulse response of a transfer characteristic is relativelyshort (several milliseconds to several dozen milliseconds), such as thecase of influence of a telephone line, but is not sufficiently effectivein performance when the impulse response of a transfer characteristic islonger (several hundred milliseconds), such as the case of an echo in aroom. The reason for the disadvantage is that the length of the transfercharacteristic of an echo in a room is generally longer than the windowwidth (10 msec-40 msec) for a short-distance analysis used for speechrecognition, and therefore the impulse response is not stable in theanalysis interval.

As an echo suppression method in which short-interval analysis is notemployed, there has been proposed a method in which multiple microphonesare used and an inverse filter is designed to exclude echo componentsfrom a speech signal (M. Miyoshi and Y. Kaneda, “Inverse Filtering ofroom acoustics,” IEEE Trans. on ASSP, Vol. 36, pp. 145-152, No. 2,1988). This method has a disadvantage that the impulse response of anacoustic transfer characteristic may not be in the minimum phase; and,therefore it is difficult to design a realistic inverse filter.Furthermore, multiple microphones often may not be installed because ofthe cost and physical arrangement condition, depending on the intendeduse environment.

As a method for coping with an echo, various methods have been proposedsuch as an echo canceller disclosed in Published Unexamined PatentApplication No. 2002-152093, for example. However, these methods requirespeech to be inputted with two channels and are not capable of copingwith an echo encountered with one-channel speech input. As an echocanceller technique, the method and the device described in PublishedUnexamined Patent Application No. 9-261133 are known. However, the echoprocessing method disclosed in the Published Unexamined PatentApplication No. 9-261133 is not a generalized method because it requiresspeech measurement at multiple places under the same echo environment.

As for speech recognition in which environmental noises are taken intoconsideration, it is possible to cope with noises using a method, suchas a method of recognizing a speech under sudden noises by selecting anacoustic model for each frame, which is disclosed in Patent ApplicationSpecification No. 2002-72456 attributed to the common applicant, forexample. However, an effective method related to speech recognition,which effectively utilizes the characteristic not of a suddenlygenerated noise but of an echo generated depending on an environment,has not been known.

A method of predicting an intra-frame transfer characteristic H to feedit back for speech recognition has been reported by T. Takiguchi, et al.(“HMM-Separation-Based Speech Recognition for a Distant Moving Speaker”,IEEE Trans. on SAP, Vol. 9, pp. 127-140, No. 2, 2001), for example. Inthis method, a transfer characteristic H in a frame is used to reflectthe influence of an echo; a speech input is inputted via a head-set typemicrophone as a reference signal; an echo signal is separately measured;and then, based on the result of the two-channel measurement, an echoprediction coefficient a for predicting an echo is acquired. Though acase is shown where echo influence is not taken into consideration atall, even when using the above method by Takiguchi et al., it is alsoshown that speech recognition with a sufficiently high accuracy can beperformed in comparison with processing by a CMS method; however, thismethod does not enable speech recognition only from a speech signalmeasured in a hand-free environment.

If a user who does not use his hands or a user in an environment where ahead-set type microphone can not be carried or worn is able to performspeech recognition, availability of speech recognition can beconsiderably extended. Furthermore, though the existing techniquesdescribed above are known, availability of speech recognition can befurther extended if the speech recognition accuracy can be furtherimproved in comparison with the existing techniques. For example, theabove-mentioned environments include a case where processing isperformed based on speech recognition when driving a vehicle or pilotinga plane, or during movement within a large space, and a case wherespeech is inputted into a notebook-type personal computer or amicrophone located at a distance for a kiosk device.

As described above, at least use of a head-set type microphone or a handmicrophone is assumed in traditional speech recognition methods.However, with miniaturization of computer devices and expansion ofapplications, there is an increasing demand for a speech recognitionmethod to be used in an environment where echoes must be taken intoconsideration and an increasing demand for enabling a hands-free speechrecognition function even in an environment where echoes may begenerated. In the present invention, the term “hands-free” is used tomean a condition in which a speaker can speak at any position withoutrestriction by the position of a microphone.

SUMMARY OF THE INVENTION

The present invention has been made in consideration of theabove-mentioned disadvantages of the conventional speech recognition. Inthe present invention, there is provided a method for coping withinfluence of an echo in a room by adapting an acoustic model used inspeech recognition (hidden Markov model) to a speech signal in an echoenvironment. In the present invention, the influence of echo componentsin a short-interval analysis is estimated using a signal observed forinput from one microphone (one channel). This method does not require animpulse response to be measured in advance, and enables echo componentsto be estimated based on the maximum likelihood estimation utilizing anacoustic model by using only a speech signal spoken at any place.

The present invention has been made based on the idea that it ispossible to perform sufficient speech recognition not by actuallymeasuring a speech signal superimposed with an echo or a noise(hereinafter referred to as a “speech model affected by intra-frame echoinfluence” in the present invention) with the use of a head-set typemicrophone or a hand microphone, but by expressing it with an acousticmodel used for speech recognition to estimate an echo predictioncoefficient based on the maximum likelihood reference.

When an echo is superimposed, the inputted speech signal and theacoustic model are different only by the echo. The present invention hasbeen made based on the finding that, in consideration of the longimpulse response, an echo can be sufficiently simulated even if the echois assumed to be superimposed onto a speech signal O(ω; t), which isbeing determined at the current time point, while being dependent on aspeech signal O(ω; tp) in a frame in the past. In the present invention,an echo can be defined as an acoustic signal which influences a speechsignal for a longer time than an impulse response, the signal whichgives the echo being a speaking voice giving the speech signal. Thoughit is not required to define an echo more clearly in the presentinvention, when seen in connection with the time width of an observationwindow to be used, it can be basically defined as an acoustic signalwhich gives influence longer than the time width of the observationwindow.

In this case, acoustic model data (an HMM parameter and the like), whichis usually used as an acoustic model, can be regarded as a referencesignal with a high accuracy, related to a phoneme generated with aspeech corpus and the like. A transfer function H in a frame can bepredicted with sufficient accuracy based on an existing technique. Inthe present invention, a “speech model affected by intra-frame echoinfluence” equivalent to a signal which has been conventionally inputtedseparately as a reference signal is-generated from an acoustic modelwith the use of additivity of a cepstrum. Furthermore, an echoprediction coefficient α can be estimated so that a selected speechsignal is given the maximum probability. The echo prediction coefficientis used to generate an adapted acoustic model which has been adapted toan environment to be used by a user, in order to perform speechprediction. According to the present invention, speech input as areference signal is not required, and it is possible to perform speechrecognition using only a speech signal from one channel. Furthermore,according to the present invention, it is possible to provide a robustspeech recognition device and speech recognition method to cope with anecho influence problem which may be caused when a speaker speaks awayfrom a microphone.

That is, according to the present invention, there is provided a speechrecognition device configured to include a computer, for recognizing aspeech; the speech recognition device comprising: a storage area forstoring a feature quantity acquired from a speech signal for each frame;storing portions for storing acoustic model data and language modeldata, respectively; an echo adaptation model generating portion forgenerating echo speech model data from a speech signal acquired prior toa speech signal to be processed at the current time point and using theecho speech model data to generate adapted acoustic model data; andrecognition processing means for referring to the feature quantity, theadapted acoustic model data and the language model data to provide aspeech recognition result of the speech signal.

The adapted acoustic model generating means in the present invention cancomprise: a model data area transforming portion for transformingcepstrum acoustic model data into linear spectrum acoustic model data;and an echo prediction coefficient calculating portion for adding theecho speech model data to the linear spectrum acoustic model data togenerate an echo prediction coefficient giving the maximum likelihood.

The present invention comprises an adding portion for generating echospeech model data, and the adding portion can add the cepstrum acousticmodel data of the acoustic model and cepstrum acoustic model data of anintra-frame transfer characteristic to generate a “speech model affectedby intra-frame echo influence”.

The adding portion in the present invention inputs the generated “speechmodel affected by intra-frame echo influence” into the model data areatransforming portion and causes the model data area transforming portionto generate linear spectrum acoustic model data of the “speech modelaffected by intra-frame echo influence”.

The echo prediction coefficient calculating portion in the presentinvention can use at least one phoneme acquired from an inputted speechsignal and the echo speech model data to maximize likelihood of the echoprediction coefficient based on linear spectrum speech model data. Thespeech recognition device in the present invention preferably performsspeech recognition using a hidden Markov model.

According to the present invention, there is provided a speechrecognition method for causing a speech recognition device configured toinclude a computer, for recognizing a speech, to perform speechrecognition; the method causing the speech recognition device to executesteps of: storing in a storage area a feature quantity acquired from aspeech signal for each frame; reading from the storing portion a speechsignal acquired prior to a speech signal to be processed at the currenttime point to generate echo speech model data and processing speechmodel data stored in a storing portion to generate adapted acousticspeech model data and store it in a storage area; and reading thefeature quantity, the adapted acoustic model data and language modeldata stored in a storing portion to generate a speech recognition resultof the speech signal.

According to the present invention, the step of generating the adaptedacoustic model data can comprise: an adding portion calculating the sumof the read speech signal and an intra-frame transfer characteristicvalue; and causing a model data area transforming portion to read thesum calculated by the adding portion to transform cepstrum acousticmodel data into linear spectrum acoustic model data.

The present invention can comprise a step of causing an adding portionto read and add the linear spectrum acoustic model data and the echospeech model data to generate an echo prediction coefficient giving themaximum likelihood. In the present invention, the step of transformationinto the linear spectrum acoustic model data can comprise a step ofcausing the adding portion to add the cepstrum acoustic model data ofthe acoustic model data and cepstrum acoustic model data of anintra-frame transfer characteristic to generate a “speech model affectedby intra-frame echo influence”.

The step of generating the echo prediction coefficient in the presentinvention can comprise a step of determining the echo predictioncoefficient so that the maximum likelihood is given to at least onephoneme for which the sum value of the linear spectrum echo model dataof the “speech model affected by intra-frame echo influence” and theecho speech model data, which has been generated by the adding portionand stored.

In the present invention, there are provided a computer-readable programfor causing a computer to execute the above-mentioned speech recognitionmethods and a computer-readable storage medium storing thecomputer-readable program.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention will hereinafter be described in greater detail withreference to the appended drawings wherein:

FIG. 1 schematically illustrates speech recognition using a hiddenMarkov model (HMM);

FIG. 2 schematically illustrates a process for forming an outputprobability table based on each state for a speech signal;

FIG. 3 is a flowchart showing a schematic procedure for a speechrecognition method of the present invention;

FIG. 4 shows schematic processing in the process described in FIG. 3;

FIG. 5 is a schematic block diagram of a speech recognition device ofthe present invention;

FIG. 6 shows a detailed configuration of an adapted acoustic model datagenerating portion used in the present invention;

FIG. 7 is a schematic flowchart showing a process of a speechrecognition method to be performed by a speech recognition device of thepresent invention;

FIG. 8 shows an embodiment in which a speech recognition device of thepresent invention is configured as a notebook-type personal computer;and

FIG. 9 shows a typical method in which noises are taken intoconsideration for speech recognition.

DETAILED DESCRIPTION OF THE INVENTION

The present invention is now described according to the embodiment shownin the drawings. The present invention, however, is not limited to theembodiment described below.

A. Summary of Speech Recognition Using a Hidden Markov Model

FIG. 1 schematically illustrates speech recognition using a hiddenMarkov model (HMM) to be used in the present invention. An acousticmodel can be regarded as an automaton in which a word or a sentence isconstructed as a sequence of phonemes; three states are typicallyprovided for each phoneme; and a transition probability among thesestates is specified so that a word or a sentence composed of a sequenceof phonemes can be retrieved. In the embodiment shown in FIG. 1, thereare illustrated three phonemes S1 to S3. The transition probabilityPr(S1|S0) from the state S1 to S2 is shown as 0.5, and the transitionprobability Pr(S3|S2) is shown as 0.3.

An output probability to be determined in association with a phonemegiven by mixed Gaussian distribution, for example, is assigned to eachof the states S1 to S3. In the embodiment shown in FIG. 1, it is shownthat mixed elements k1 to k3 are used for the states S1 to S3. FIG. 1also shows an output probability distribution of mixed Gaussiandistribution for the state S1, shown as k1 to k3. The mixed elements areprovided with weights w1 to w3, respectively, to be suitably adapted toa particular speaker. When the above-mentioned acoustic model is used,the output probability is defined to be given by Pr(O|λ), where “O” ofalphabet is a speech signal and λ is a set of HMM parameters.

FIG. 2 shows a process for generating an output probability tableaccording to the present invention. In the embodiment shown in FIG. 2,the output probability from the state S1 to the state S3 can becalculated by composing a trellis as shown in FIG. 2 using a featurequantity series {α β α} acquired from a speech signal and using analgorithm such as a Viterbi algorithm, a forward algorithm, abeam-search algorithm and the like. More generally, an outputprobability for a speech signal based on each state can be given as anoutput probability table, where t is a predetermined frame, Ot is aspeech signal at the predetermined frame t, s is a state, λ is a set ofHMM parameters. $\begin{matrix}{\text{[Equation~~~1]}{{\Pr( O \middle| \lambda )} = {\overset{\quad}{\sum\limits_{allS}}\quad{\prod\limits_{t = 1}^{T}\quad{{\Pr( { O_{t} \middle| S_{t} ,S_{t - 1},\lambda} )}{\Pr( { S_{t} \middle| S_{t - 1} ,\lambda} )}}}}}} & (1)\end{matrix}$

In speech recognition using HMM, by using the above-mentioned outputprobability table to retrieve a phoneme string with the maximumlikelihood, the output result, that is, a word or a sentence isdetermined. Though each state is described by Gaussian distribution, thestate between the first phoneme and the last phoneme is determined bythe likelihood based on the state transition probability. As for typicalspeech recognition using HMM, “digital signal processing for speech andsound information” by Shikano et al. (Sho-ko-do, ISBN 4-7856-2014) canbe referred to, for example.

B. Process in a Speech Recognition Method According to the PresentInvention

FIG. 3 shows a flowchart showing a schematic procedure of a speechrecognition method of the present invention. As shown in FIG. 3, theprocess of the speech recognition method of the present inventionreceives input of a speech signal at step S10, and, at S12, generatesfrom acoustic model data and an intra-frame transfer characteristic a“speech model affected by intra-frame echo influence” At step S14, anecho prediction coefficient α and a speech signal in the past are usedto generate echo speech model data (α×O {w; tp}).

The generated echo speech model data is, at step S16, added to the“speech model affected by intra-frame echo influence” given at step S12as linear spectrum acoustic model data, and then an echo predictioncoefficient α is so determined that the maximum likelihood value can beobtained for a selected word or sentence obtained by processing thespeech signal. At step S18, the determined echo prediction coefficient αand the speech signal O(ω; tp) in a frame in the past are used toacquire the absolute value of an echo. The absolute value is added tothe mean value vector μ of the speech model affected by inter-frame echoinfluence to calculate μ′=μ+α×0(ω;tp). A speech model which alsoincludes outer-frame echo components is generated and stored as a setwith other parameters. After that, at step S20, the speech signal andthe adapted acoustic model data are used to perform speech recognition,and at step S22, the recognition result is outputted.

FIG. 4 shows a schematic process for the processing described withreference to FIG. 3 of the present invention. First, acoustic model dataand a cepstrum of an intra-frame transfer characteristic are added tocreate data of a “speech model affected by intro-frame echo influence”.By applying a method such as discrete Fourier transformation andindexation processing, the generated speech model data is transformedinto linear spectrum acoustic model data. Furthermore, an echoprediction coefficient α is determined so that likelihood is maximizedfor the feature quantity of an phoneme included in the speech signalselected in the transformed spectrum data. Various methods can be usedfor the setting, and a predetermined word or a predetermined sentence,for example, may be appropriately used for the determination. Thedetermined echo prediction coefficient α, together with acoustic modeldata originally stored in the speech recognition device, is used tocreate adapted acoustic model data. The acoustic model data within thegenerated linear spectrum area is logarithmically transformed andinverse Fourier transformed to be a cepstrum, and the cepstrum is storedto perform speech recognition.

A case where a speech signal is a speech including an echo is nowconsidered. It is known that, when an echo is superimposed onto speech,the speech signal O′(ω; t) with a frequency ω and a frame number t,which is observed at the current time point, is shown by the formula (2)below using a speech signal in the past O(ω; tp) (“A method ofreverberation compensation based on short time spectral analysis” byNakamura, Takiguchi, and Shikano, Proceesing of the meeting of theAcoustical Society of Japan, March 1998, 3-6-11).

[Equation 2]O′(ω; t)≅S(ω; t)·II(ω)+α·O(ω; t−1)=exp [ cos {S _(cep)(c; t)+II_(cep)(c)}]+α·O(ω; t−1)  (2)

In the above formula, a standard acoustic model generated with a speechcorpus and the like can be used in the present invention, and this isreferred to as a clean speech signal in the present invention. Aprediction value for transfer characteristic in the same frame is usedfor H. The α is an echo prediction coefficient showing the rate of anecho to be imposed from a frame in the past to the frame to be evaluatedat the current time point. The subscript “cep” indicates a cepstrum.

Conventionally, acoustic model data used for speech recognition in thepresent invention is used instead of a reference signal. Furthermore,the intra-frame transfer characteristic H is acquired as a predictionvalue, and an echo prediction coefficient is determined using a speechsignal selected based on the maximum likelihood reference to generateadapted acoustic model data.

When an echo is superimposed, the inputted speech signal and theacoustic model data are different only by the echo. In the presentinvention, attention has been focused on the fact that, in considerationof the long impulse response, an echo can be sufficiently simulated evenif the echo is assumed to be superimposed onto a speech signal O(ω; t)to be determined at the current time point while being dependent on aspeech signal O(ω; tp) in the immediately previous frame. That is, byusing the formula (2) above to determine acoustic model data with thehighest likelihood for a speech signal from a predetermined acousticmodel data and the value of α, it is possible to use a correspondinglanguage model data to perform speech recognition using only a speechsignal from one channel.

Though addition of an intra-frame transfer characteristic H to acousticmodel data can be performed by convolution in a spectrum area,transformation into a cepstrum area enables an addition condition to besatisfied. Therefore, if the intra-frame transfer characteristic H canbe estimated by another method, it is possible to easily use additivitywith acoustic model data to easily and accurately determine acousticmodel data, which takes the intra-frame transfer characteristic H intoconsideration, through addition to data in the cepstrum area of acousticmodel data already registered.

A set of parameters for an HMM of a clean speech signal S is indicatedby λ_((s), cep), a set of HMM parameters for the intra-frame transfercharacteristic H is indicated by λ_((h′), cep), and a set of HMMparameters for adapted acoustic model data is indicated by λ_((O), cep).In the present invention, attention is paid only to output probabilitydistribution among acoustic model data, and λ_((s)) is shown asλ_((s))={μ_(j,k), O² _((s)j,k), W_(j,k)}, where μ_(j,k) is the meanvalue of the k-th output probability of a state j of a predeterminedHMM, O² _((s)j,k) is distribution, and W_(j,k) is weight. These HMMparameters for acoustic model data are usually regarded as a cepstrummost suitable for speech recognition and applied to speech recognition.

As for estimation of an intra-frame transfer characteristic at step S12in FIG. 3, in a particular embodiment of the present invention, forexample, an intra-frame transfer function H can be used, which isacquired in the method described in “HMM-Separation-Based SpeechRecognition for a Distant Moving Speaker” by T. Takiguchi, et al., IEEETrans. on SAP, Vol. 9, No. 2, 2001, when it is assumed for conveniencethat there is no echo and α=0 is set. The intra-fame transfer functioncreated can be subject to Discrete Fourier Transformation and indexationprocessing, then transformed to a cepstrum area, and stored in a storagearea.

Furthermore, various methods can be used when the echo predictioncoefficient α is calculated based on likelihood. In the particularembodiment described in the present invention, an EM algorithm (“Aninequality and associated maximization technique in statisticalestimation of probabilistic function of a Markov process”, Inequalities,Vol. 3, pp. 1-8, 1972) can be used to calculate a prediction value forthe maximum likelihood α′.

Calculation processing of an echo prediction coefficient α using the EMalgorithm is performed by using the E step and the M step of the EMalgorithm. In the present invention, a set of HMM parameters transformedinto a linear spectrum area is used to calculate at the E step the Qfunction shown by the formula (3) below. $\begin{matrix}{\text{[Equation~~~3]}{{Q( \alpha^{\prime} \middle| \alpha )} = {E\lbrack { {\log\quad{\Pr( {O,s, k \middle| \lambda_{{({SH})},{lin}} ,\alpha^{\prime}} )}} \middle| \lambda_{{({SH})},{lin}} ,\alpha} \rbrack}}{\sum\limits_{p}{\sum\limits_{n}{\sum\limits_{s_{p,n}}{\sum\limits_{m_{p,n}}{{\frac{\Pr( {O_{p,n},s_{p,n}, m_{p,n} \middle| \lambda_{{({SH})},{lin}} ,\alpha} )}{\Pr( { O_{p,n} \middle| \lambda_{{({SH})},{lin}} ,\alpha} )} \cdot \log}\quad{\Pr( {O_{p,n},s_{p,n}, m_{p,n} \middle| \lambda_{{({SH})}{lin}} ,\alpha^{\prime}} )}}}}}}} & (3)\end{matrix}$

In the above formula, the index of an HMM parameter (indicating apredetermined phoneme, for example) is indicated by p, the n-thobservation series is indicated by O_(p,n) related to a phoneme p, and astate series and a mixed element series for each O_(p,n) are indicatedby s_(p,n) and m_(p,n). The mean value, distribution and weight of thek-th output probability distribution (mixed Gaussian distribution) of astate j of a phoneme p of λ_((SH), lin) are shown as the expression (4)below.

[Equation 4]{μ_((SH),p,j,k), σ² _((SH),p,j,k), W_((SH),p,j,k)}  (4)

When the number of dimensions for each is indicated by D, if attentionis paid only to the output probability distribution of the above Qfunction, then the Q function is shown as the formula (5) below.$\begin{matrix}{\text{[Equation~~~5]}{Q( \alpha^{\prime} \middle| \alpha )} = {- {\sum\limits_{p}{\sum\limits_{n}{\sum\limits_{j}{\sum\limits_{k}{\sum\limits_{t}{\gamma_{p,n,j,k,t}\quad\{ {{\frac{1}{2}{\log( {2\pi} )}^{D}\sigma_{{({SH})},p,j,k}^{2}} + \frac{\{ {{O_{p,n}(t)} - \mu_{{({SH})},p,j,k} - {\alpha^{\prime} \cdot {O_{p,n}( {t - 1} )}}} \}^{T}\{ {{O_{p,n}(t)} - \mu_{{({SH})},p,j,k} - {\alpha^{\prime} \cdot {O_{p,n}( {t - 1} )}}} \}}{2\sigma_{{({SH})},p,j,k}^{2}}} \}}}}}}}}} & (5)\end{matrix}$

In the above formula, the frame number is indicated by t. Theγ_(p.n.j.k.t) is a probability given by the formula (6) below.

[Equation 6]γ_(p,n,j,k,t) =Pr(O _(p,n)(t), j, k|λ _((SH),lin), α)  (6)

The Q function is then maximized relative to α′ at the M step(maximization) in the EM algorithm.

[Equation 7]α′=argmax_(a′) ′Q(α′|α)  (7)

The maximum likelihood α′ can be obtained by partially differentiatingthe obtained Q by α′ to determine the maximum value. As a result, the α′is given by the formula (8) below. $\begin{matrix}{\text{[Equation~~~8]}{\alpha^{\prime} = \frac{\sum\limits_{p}{\sum\limits_{n}{\sum\limits_{j}{\sum\limits_{k}{\sum\limits_{t}{\gamma_{p,n,j,k,t}\frac{{O_{p,n}{(t) \cdot {O_{p,n}( {t - 1} )}}} - {{O_{p,n}( {t - 1} )} \cdot \mu_{{({SH})},p,j,k}}}{\sigma_{{({SH})},p,j,k}^{2}}}}}}}}{\sum\limits_{p}{\sum\limits_{n}{\sum\limits_{j}{\sum\limits_{k}{\sum\limits_{t}{\gamma_{p,n,j,k,t}\frac{O_{p,n}^{2}( {t - 1} )}{\sigma_{{({SH})},p,j,k}^{2}}}}}}}}}} & (8)\end{matrix}$

In the present invention, the α′ can be estimated for each phoneme p. Inthis case, as given by the formula (9) below, the α′ for each phonemecan be acquired by using a value before calculating the sum for thephoneme p. $\begin{matrix}{\text{[Equation~~~9]}{\alpha_{p}^{\prime} = \frac{\sum\limits_{n}{\sum\limits_{j}{\sum\limits_{k}{\sum\limits_{t}{\gamma_{p,n,j,k,t}\frac{{O_{p,n}{(t) \cdot {O_{p,n}( {t - 1} )}}} - {{O_{p,n}( {t - 1} )} \cdot \mu_{{({SH})},p,j,k}}}{\sigma_{{({SH})},p,j,k}^{2}}}}}}}{\sum\limits_{n}{\sum\limits_{j}{\sum\limits_{k}{\sum\limits_{t}{\gamma_{p,n,j,k,t}\frac{O_{p,n}^{2}( {t - 1} )}{\sigma_{{({SH})},p,j,k}^{2}}}}}}}}} & (9)\end{matrix}$

Which echo prediction coefficient is to be used can be determinedaccording to a particular device and a request such as recognitionefficiency and recognition speed. It is also possible to determine α′for each HMM state similar to the formulae (8) and (9). By performingthe calculation processing described above, an echo predictioncoefficient α can be acquired only from a speech signal O(t) inputtedfrom one channel away from a speaker using only parameters of theoriginal acoustic model.

C: Speech Recognition Device of the Present Invention and a ProcessingMethod Thereof

FIG. 5 shows a schematic block diagram of a speech recognition device ofthe present invention. The speech recognition device 10 of the presentinvention is generally configured with a computer including a centralprocessing unit (CPU). As shown in FIG. 5, the speech recognition device10 of the present invention comprises a speech signal acquiring portion12, a feature quantity extracting portion 14, a recognition processingportion 16 and an adapted acoustic model data generating portion 18. Thespeech signal acquiring portion 12 transforms a speech signal inputtedfrom inputting means such as a microphone (not shown) into a digitalsignal with an A/D transformer and the like, and stores it in a suitablestorage area 20 with its amplitude associated with a time frame. Thefeature quantity extracting portion 14 is configured to include a modeldata area transforming portion 22.

The model data area transforming portion 22 comprises Fouriertransformation means (not shown), indexation means and inverse Fouriertransformation means. The model data area transforming portion 22 readsa speech signal stored in the storage area 20 to generate a cepstrum ofthe speech signal, and stores it in a suitable area of the storage area20. The feature quantity extracting portion 14 acquires a featurequantity series from the generated cepstrum of the speech signal andstores it in association with a frame.

The speech recognition device 10 shown in FIG. 5 is configured tofurther include an acoustic model data storing portion 24 for storingacoustic model data based on an HMM, which has been generated with theuse of a speech corpus and the like, a language model data storingportion 26 for storing language model data acquired from a text corpusand the like, and an adapted acoustic model data generating portion 18for storing adapted acoustic model data generated by the presentinvention.

The recognition processing portion 16, in the present invention, isconfigured to read adapted acoustic model data from an adapted acousticmodel data storing portion 28, read language model data from thelanguage model data storing portion 26, and use likelihood maximizationto perform speech recognition for each read data based on the cepstrumof the speech signal.

Each of the acoustic model data storing portion 24, the language modeldata storing portion 26 and the adapted acoustic model data storingportion 28 may be a database constructed in a storage device such as ahard disk. The adapted acoustic model data generating portion 18 shownin FIG. 5 creates adapted acoustic model data through theabove-mentioned processing in the present invention, and causes it to bestored in the adapted acoustic model data storing portion 28.

FIG. 6 shows a detailed configuration of an adapted acoustic model datagenerating portion 18 to be used in the present invention. As shown inFIG. 6, the adapted acoustic model data generating portion 18 to be usedin the present invention is configured to include a buffer memory 30,model data area transforming portions 32 a and 32 b, an echo predictioncoefficient calculating portion 34, adding portions 36 a and 36 b, and agenerating portion 38. The adapted acoustic model data generatingportion 18 reads predetermined observation data older than the frame tobe processed at the current time point, and multiplies it by an echoprediction coefficient α, and stores it in the buffer memory 30. At thesame time, the adapted acoustic model data generating portion 18 readsacoustic model data from the acoustic model data storing portion 24, andreads the cepstrum acoustic model data of the intra-frame transfercharacteristic H which has been calculated in advance from the storagearea 20 and writes it to the buffer memory 30.

Since both of the acoustic model data stored in the buffer memory 30 andthe intra-frame transfer characteristic data are cepstrum acoustic modeldata, these data are read into the adding portion 36 a and addition isperformed to generate a “speech model affected by intra-frame echoinfluence”. The “speech model affected by intra-frame echo influence” issent to the model data area transforming portion 32 a to be transformedinto linear spectrum acoustic model data, and then it is sent to theadding portion 36 b. The adding portion 36 b reads data obtained bymultiplying observation data in the past by an echo predictioncoefficient and performs addition to the linear spectrum acoustic modeldata of the “speech model affected by intra-frame echo influence”.

The addition data generated at the adding portion 36 b is sent to theecho prediction coefficient calculating portion 34 storing acousticmodel data corresponding to a phoneme and the like selected in advanceto determine an echo prediction coefficient α so that the likelihood ismaximal, using an EM algorithm. The determined echo predictioncoefficient α is passed to the generating portion 38 together withacoustic model data stored after being transformed into linear spectrumacoustic model data or still remaining linear spectrum, and created asadapted acoustic model data. The generated adapted acoustic model datais sent to the model data area transforming portion 32 b, and istransformed from linear spectrum acoustic model data into cepstrumacoustic model data. After that, it is stored in the adapted acousticmodel data storing portion 28.

FIG. 7 is a schematic flowchart showing a process of a speechrecognition method to be performed by a speech recognition device of thepresent invention. As shown in FIG. 7, at step S30, the recognitionprocess to be performed by the speech recognition device of the presentinvention acquires a speech signal superposed with an echo for eachframe and stores in a suitable storage area at least the frame to beprocessed at the current time point and a preceding frame. At step S32,the process extracts a feature quantity from the speech signal, acquiresdata to be used for retrieval of the speech signal based on acousticmodel data and language model data, and stores the data as cepstrumacoustic model data in a suitable storage area.

At step S34, which can be performed in parallel with step S32, a speechsignal in a frame in the past and acoustic model data are read from asuitable storage area, transformation into a cepstrum area andtransformation into a linear spectrum area are done to create adaptedacoustic model data, and the data are stored in a suitable storage areain advance. At step S36, the adapted acoustic model data and the featurequantity acquired from the speech signal are used to determine a phonemeto which the maximum likelihood is to be given. At step S38, languagemodel data are used based on the determined phoneme to generate arecognition result, and the result is stored in a suitable storage area.At the same time, the sum of likelihoods at the current time point arestored. After that, at step S40, it is determined whether there remainsa frame to be processed. If there is no frame to be processed (no), thena word or a sentence for which the sum of likelihoods is maximal isoutputted as a recognition result at step S42. If there is any frame yetto be processed, a “yes” determination at step S40, then at step S44,observation data for the remaining frame is read, and a feature quantityis extracted. The process is then returned to step S36, and recognitionof the word or sentence is completed by repetition of the process.

FIG. 8 shows an embodiment in which a speech recognition device of thepresent invention is configured as a notebook-type personal computer 40.An internal microphone 42 is arranged at the upper side of the displaypart of the notebook-type personal computer 40 to receive speech inputfrom a user. The user moves a cursor displayed on the display part withpointer means 44 such as a mouse and a touch pad installed in office orat home to perform various processings.

It is now assumed that a user desires to perform dictation withword-processor software, for which software by IBM Corporation(ViaVoice: trademark registered), for example, is used, for speechrecognition. When the user places the mouse cursor on an applicationicon 46 for activating application software and clicks the mouse 44,then the word-processor software is activated at the same time that theViaVoice is activated. In the particular embodiment of the presentinvention, a speech recognition program of the present invention isincorporated in the ViaVoice software as a module.

Conventionally, a user uses a head-set type microphone or a handmicrophone to avoid the influence of echoes and environmental noiseswhen inputting a speech. Furthermore, the user is required to input aspeech by separately inputting environmental noises or echoes, and aninput speech. However, according to the speech recognition method usingthe notebook-type personal computer 40 shown in FIG. 8 of the presentinvention, the user can perform dictation through speech recognitiononly by input into the internal microphone 42 in accordance with thepresent invention.

Though FIG. 8 shows an embodiment in which the present invention isapplied to a notebook-type personal computer, the present invention isapplicable to speech-interaction type processing in a relatively smallspace where influence of echoes is larger than that of continuoussuperposition of environmental noises, such as a kiosk device forperforming speech-interaction type processing in a relatively smallpartitioned room, dictation in a car or a plane, and command recognitionand the like, in addition the processing shown in FIG. 8. Furthermore,the speech recognition device of the present invention is capable ofcommunicating with another server computer performing non-speechprocessing or a server computer suitable for speech processing via anetwork. The network described above includes the Internet using acommunication infrastructure such as a local area network (LAN), a widearea network (WAN), optical communication, ISDN, and ADSL.

In the speech recognition method of the present invention, only speechsignals continuously inputted in chronological order are used, and extraprocessing steps for separately storing and processing a referencesignal using multiple microphones and hardware resources for the extrasteps are not required. Furthermore, availability of speech recognitioncan be expanded without use of a head-set type microphone or a handmicrophone for acquiring a reference signal as a “speech model affectedby intra-frame echo influence”

Though the present invention has been described based on a particularembodiment shown in the drawings of the present invention, it is notlimited to the described particular embodiment. Each functional portionor functional means is implemented by causing a computer to execute aprogram, and is not necessarily required to be incorporated as acomponent for each functional block shown in the drawings. Furthermore,as a computer-readable programming language for configuration of aspeech recognition device of the present invention, the assemblerlanguage, the FORTRAN, the C language, the C++ language, Java® and thelike are included. A computer-executable program for causing a speechrecognition method of the present invention to be executed can be storedin a ROM, EEPROM, flash memory, CD-ROM, DVD, flexible disk, hard diskand the like for distribution.

D: Embodiment Example

The present invention is now described using a concrete example. Animpulse response actually measured in a room was used to create a speechunder echoes. A frame value corresponding to 300 msec was used as anecho time for the embodiment example, an reference example and acomparison example. The distance between a sound source and a microphonewas set to be 2 m, and a speaking voice was inputted into the microphonefrom its front side. The sampling frequency of 12 kHz, the window widthof 32 msec, and the analysis period of 8 msec were used as signalanalysis conditions. A sixteen dimensional MFCC (Mel Frequency CepstralCoefficient) was used as an acoustic feature quantity.

Since 8 msec was set for the analysis period, a speech signal in thepast displaced by four frames was used for processing of an echo signalin order to prevent windows from being overlapped with each other. Foreach of the embodiment example, the reference example, and thecomparison example, an input speech signal to be used was generated withfifty five phonemes. As for calculation of an echo predictioncoefficient α, the maximum likelihood was calculated with the use ofphonemes for one word among the inputted input signals. The obtainedecho prediction coefficient α was applied to all the speechrecognitions. The result of a recognition success rate obtained whenfive hundred words were recognized is shown below. TABLE 1 EmbodimentReference Comparison Comparison example example example 1 example 2Method This Takiguchi et CMS Without echo invention al. compensationRecognition 92.8% 91.2% 86.0% 54.8% success rateAs shown in Table 1 above, the result of the case without echocompensation (comparison example 2) was 54.8%. By comparison, therecognition success rate was improved to 92.8% by the present invention(embodiment example). This result is slightly better than the result ofthe reference example by Takiguchi et al. (the above mentioned“HMM-Separation-Based Speech Recognition for a Distant Moving Speaker”by T. Takiguchi, et al., IEEE Trans. on SAP, Vol. 9, pp. 127-140, No. 2,2001) in which a reference signal and two-channel data are used. In thecomparison example 1, in which the CMS method (Cepstrum MeansSubtraction Method) is used, the recognition success rate was 86%, whichis lower than the success rate of the embodiment example of the presentinvention. That is, it has been proved that, according to the presentinvention, a recognition success rate better than that of conventionalmethods can be provided though one-channel data is used therein.

1) a speech recognition device configured to include a computer, thespeech recognition device comprising: a storage area for storing afeature quantity acquired from a speech signal for each frame; storingportions for storing acoustic model data and language model data,respectively; an echo adaptation model generating portion for generatingecho speech model data from a speech signal acquired prior to a speechsignal to be processed at the current time point and using the echospeech model data to generate adapted acoustic model data; andrecognition processing means for utilizing said feature quantity, saidadapted acoustic model data and said language model data to provide aspeech recognition result of the speech signal. 2) The speechrecognition device according to claim 1; wherein said adapted acousticmodel generating means comprises: a model data area transforming portionfor transforming cepstrum acoustic model data into linear spectrumacoustic model data; and an echo prediction coefficient calculatingportion for adding said echo speech model data to said linear spectrumacoustic model data to generate an echo prediction coefficient givingthe maximum likelihood. 3) The speech recognition device according toclaim 2, further comprising an adding portion for generating echo speechmodel data; wherein said adding portion adds the cepstrum acoustic modeldata of said acoustic model and cepstrum acoustic model data of anintra-frame transfer characteristic to generate a speech model affectedby intra-frame echo influence. 4) The speech recognition deviceaccording to claim 3; wherein said adding portion inputs said generatedspeech model affected by intra-frame echo influence into said model dataarea transforming portion and causes said model data area transformingportion to generate linear spectrum acoustic model data of said speechmodel affected by intra-frame echo influence. 5) The speech recognitiondevice according to claim 4; wherein said echo prediction coefficientcalculating portion uses at least one phoneme acquired from an inputtedspeech signal and said echo speech model data to maximize likelihood ofthe echo prediction coefficient based on linear spectrum speech modeldata. 6) The speech recognition device according to claim 5; performingspeech recognition using a hidden Markov model. 7) A speech recognitionmethod for causing a speech recognition device configured to include acomputer to perform speech recognition; the method causing the speechrecognition device to execute steps of: storing in a storage area afeature quantity acquired from a speech signal for each frame; readingfrom said storing portion a speech signal acquired prior to a speechsignal to be processed at the current time point to generate echo speechmodel data; processing a speech model stored in a storing portion togenerate adapted acoustic speech model data and store it in a storagearea; and processing said feature quantity, said adapted acoustic modeldata, and language model data stored in a storing portion to generate aspeech recognition result of the speech signal. 8) The speechrecognition method according to claim 7; wherein the step of generatingsaid adapted acoustic model data comprises steps of: an adding portioncalculating the sum of said read speech signal and an intra-frametransfer characteristic value; and a model data area transformingportion to read said sum calculated by said adding portion to transformcepstrum acoustic model data into linear spectrum acoustic model data.9) The speech recognition method according to claim 8, furthercomprising a step of: causing an adding portion to read and add saidlinear spectrum acoustic model data and said echo speech model data togenerate an echo prediction coefficient giving the maximum likelihood.10) The speech recognition method according to claim 9; wherein the stepof transformation into said linear spectrum acoustic model datacomprises a step of causing said adding portion to add the cepstrumacoustic model data of said acoustic model and cepstrum acoustic modeldata of an intra-frame transfer characteristic to generate a speechmodel affected by intra-frame echo influence. 11) The speech recognitiondevice according to claim 10: wherein the step of generating said echoprediction coefficient comprises a step of determining the echoprediction coefficient so that the maximum likelihood is given to atleast one phoneme for which the sum value of the linear spectrum echomodel data of said speech model affected by intra-frame echo influenceand said echo speech model data, which has been generated by said addingportion and stored. 12) A computer-readable program for causing acomputer to execute the speech recognition method comprising the stepsof: storing in a storage area a feature quantity acquired from a speechsignal for each frame; reading from said storing portion a speech signalacquired prior to a speech signal to be processed at the current timepoint to generate echo speech model data; processing a speech modelstored in a storing portion to generate adapted acoustic speech modeldata and store it in a storage area; and processing said featurequantity, said adapted acoustic model data, and language model datastored in a storing portion to generate a speech recognition result ofthe speech signal. 13) A storage medium storing a computer-readableprogram for causing a computer to execute a speech recognition method,said method comprising the steps of: storing in a storage area a featurequantity acquired from a speech signal for each frame; reading from saidstoring portion a speech signal acquired prior to a speech signal to beprocessed at the current time point to generate echo speech model data;processing a speech model stored in a storing portion to generateadapted acoustic speech model data and store it in a storage area; andprocessing said feature quantity, said adapted acoustic model data, andlanguage model data stored in a storing portion to generate a speechrecognition result of the speech signal.